1 Description about the IODS-course

This course is about open data science and the purpose is to get familiar with open data, open science and data science. If you are interested in these topics here is some additional information: open data, open access and data science.

1.1 Tools and rules

We are going to use the R language for statistical computing, see R as a tool. Also worth mentioning that we can accomplish the whole course in web. That why it’s called a MOOC.

my Github respitory You can find it here


2 Regression and model validation

 

In this section I’m going to analyze the data set of students who participated to the course called Introduction to Social Statistics at Helsinki University in the fall 2014. These 183 students were asked to participate to the survey about their learning approaches consisting deep, strategic and surface orientated learning. In addition the students were asked to evaluate their attitude towars statistics.

You can find some basic information about the variables here

 

2.1 Introducing the data set

 

I’m going to use partly the above mentioned data which still contains the different learning orientations and attitude towars statistics. In addition the data variables consist students’ age, gender and course exam points.

 

Here you can find the data set what I’m going to use in this analysis which is the same address than inside the code below.

students2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")

 

Next I’m going to explore what kind of material the students2014 data set contains.

As you can see below, there are 166 observations i.e. students and 7 variables. This suggests that not all participants of the course did take a part of this survey as there were initially 183 students. Nevertheless the available data variables contain the age and gender of the students. The genders can be recogniced F as female and M as male.

The attitude and learning orientations were measured on the Likert scale from 1 to 5. The attitude towards statistics is a sum of 10 questions, the deep orientation is 12, the strategic is 8 and the surface orientation is a sum of 12 question. Last but not least the data contains the exam points of the course i.e. exam results.

Under the structure table you can see there are no missing values (False = 1162). This is useful information considering the regression modeling and that we don’t have to run the “Missing Value Analysis”.

 

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
table(is.na(students2014))
## 
## FALSE 
##  1162

 

2.2 Summaries and graphs

 

Underneath you can see the summaries and histograms of the variables. There were no point to print histogram of gender since gender is not a continuous variable. Instead I was curious to see if there is any variation between genders and our variables therefore I formed boxplots to illustrate this. Additionally I printed the Normal QQ-plots to demonstrate the distributions of normality as the multivariate normality is one of the assumptions of the linear regression. Click to see additional information about the assumptions of linear regression.

The data contains 110 (66 %) women and 56 (34 %) men. The students’ age range between 17 and 55 where the mean age is around 25 and median is 22. From the boxplot you can see that the average age of women is lower than the average age of men. The histogram of age shows that there is some skewness in the distribution which is demonstrated in the QQ-plot as well.

As mentioned earlier the attitude towards statistics and the learning orientations were measured on the Likert scale from 1 to 5 and that the variables are sums of multiple items. This kind of variables are more likely to be normally distributed because of the Central limit theorem. You can notice in here too that all of these sums are quite normally distributed.

If you look at the attitude towards statistics you can notice that the average mean is around 3. But it is interesting that the male students seems to have somewhat higher mean than female students but it doesn’t affect so much to the overall mean because the group of men is noteworthy smaller.

When examining the learning orientation approaches you can perceive that the deep learning orientation seems to be predominant considering the average mean around 3.7 whereas the surface orientation seems to be lower with overall mean around 2.8. It seems that there isn’t much difference among deep orientation between genders even if the boxplot shows a slightly larger mean to males than females. However we can’t be sure about the statistical significance of that difference without testing it. Instead the boxplot of surface orientation implies there could be some differences between genders. The mean of surface orientation seems to be lower among men but the deviation seems to be larger compared with the group of women. Then something about the strategic learning approach. If you look at the summary below you can perceive that the overall mean is around 3.1 implying it to be more popular than surface orientation but less popular comparing with deep orientation. Surprisingly (for me) the strategic learning approach seems to bee more popular among female than male students.

The exam points seem to vary to some extent. The points range between 7 and 33 where the mean points are around 23. Check also the boxplot which implies that men have had higher points than women even though the sample means don’t deviate prominently. The histogram of age makes me wonder if we can assume the existence of the normal distribution but the QQ-plot suggests that the difference isn’t that remarkable.

 

SUMMARY

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

 

STUDENTS’ AGE

library(ggplot2)
qplot(students2014$age, geom = "histogram", binwidth = 1, main = "Histogram for students' age", xlab = "age", col = I("grey"), fill = I("chartreuse3"))

qqnorm(students2014$age)

qplot(gender, age, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and age")

 

STUDENTS’ ATTITUDE TOWARDS STATISTICS

qplot(students2014$attitude, geom = "histogram", binwidth = 0.3, main = "Histogram for attitudes towards statistics", xlab = "attitude", col = I("grey"), fill = I("orange"))

qqnorm(students2014$attitude)

qplot(gender, attitude, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and attitude")

 

DEEP LEARNING ORIENTATION

qplot(students2014$deep, geom = "histogram", binwidth = 0.3, main = "Histogram for deep learning orientation", xlab = "deep", col = I("grey"), fill = I("mediumorchid"))

qqnorm(students2014$deep)

qplot(gender, deep, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and deep orientation")

 

STRATEGIC LEARNING ORIENTATION

qplot(students2014$stra, geom = "histogram", binwidth = 0.3, main = "Histogram for strategic learning orientation", xlab = "strategic", col = I("grey"), fill = I("blue"))

qqnorm(students2014$stra)

qplot(gender, stra, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and strategic orientation")

 

SURFACE LEARNING ORIENTATION

qplot(students2014$surf, geom = "histogram", binwidth = 0.3, main = "Histogram for surface learning orientation", xlab = "surface", col = I("grey"), fill = I("coral"))

qqnorm(students2014$surf)

qplot(gender, surf, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and surface orientation")

 

EXAM POINTS

qplot(students2014$points, geom = "histogram", binwidth = 1, main = "Histogram for exam points", xlab = "Exam points", col = I("grey"), fill = I("yellow"))

qqnorm(students2014$points)

qplot(gender, points, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and points")

 

SCATTER PLOTS FOR SEARCHING THE LINEAR RELATIONSHIPS

Next I’m going to print some scatter plots to see if there could be any linear relationships between above explored variables which is also one assumption of linear regression. Later in this diary it is meant to formulate the regression model to explain particularly the exam points. Considering this I’m going to concentrate on those variables which could have a linear relationship with exam points.

First you can see the scatter plot matrix with every pair in the data (excluding gender). It seems that there could be some relationship between the attitude and exam points. Because of this I printed a scatter plot where you can take a closer look at these variables coloured by gender. Indeed the plot implies there could exist a linear relationship between the two variables. We’ll come back to this within the next section.

The pair matrix suggests that to some extent there could be a linear relationship also between strategic learning and exam points. But when you take a closer look at the separate scatter plot the dots seems to vary more than they seemed to do on the smaller picture (pair matrix). This is why I wanted to print those separate plots.

In the last matrix you can see a set of scatter plots including the distributions and the correlations between different variables. The matrix shows for example that the correlation between attitude and exam points is .437 whereas the correlation between strategic learning and points is only .146. Thus we have more evidence that the attitude could be a better variable than attitude to explain the course exam points. We will also see this in the next section.

 

library(ggplot2)
library(GGally)
pairs(students2014[-1], col = students2014$gender, main = "Scatter plot matrix")

 

attitude <- students2014$attitude
points <- students2014$points
gender <- students2014$gender
ggplot(students2014, aes(x = attitude, y = points, col = gender)) + geom_point() + xlab("Attitudes towards statistics") + ylab("Exam points") + ggtitle("Attitudes vs. Exam points") + geom_smooth(method = "lm")

 

stra <- students2014$stra
ggplot(students2014, aes(x = stra, y = points, col = gender)) + geom_point() + xlab("Strategic learning orientation") + ylab("Exam points") + ggtitle("Strategic learning vs. Exam points") + geom_smooth(method = "lm")

 

library(ggplot2)
library(GGally)
ggpairs(students2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20))) + ggtitle("Matrix of scatter plots and distributions")

 

2.3 Regression models

 

Now I’m going to choose three variables to explain the exam points and to adjust a linear regression model. I’m choosing the variables based on the highest correlation values with respect to exam points. Explanatory variables will be attitude (r = .437), strategic learning (r = .146) and surface approach (r = -.144). Note that the earlier examination of correlations suggested that the linear relatioship between strategic learning and exam points might be negligible.

 

REGRESSION MODEL 1

Below you can see that my overall model seems to be significant as a result of F-test: F(3,162) = 14.13, p < .001. However there are some problems with the explanatory variables stra (strategic learning) and surf (surface learning) with the p-values more than .05. This means I have to exclude these variables and either choose other variables from the data set or run the model only with attitude since it’s the only significance variable at this point (p < .001).

I am too curious to see other variables to explain the exam points even if the correlation matrix suggested there might not be any significant variables left to explain the exam points. Let’s take a look at the MODEL 2.

 

Regression_model_1 <- lm(points ~ attitude + stra + surf, data = students2014)
summary(Regression_model_1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

 

REGRESSION MODEL 2

Beneath you can see that my hunch was correct because of the high p values with age and deep learning orientations, in both cases p > .05. It seems that I have to run one more model with attitude as an only explanatory variable. So let’s move on to see the third model.

 

Regression_model_2 <- lm(points ~ attitude + age + deep, data = students2014)
summary(Regression_model_2)
## 
## Call:
## lm(formula = points ~ attitude + age + deep, data = students2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.60773    3.38966   4.605 8.32e-06 ***
## attitude     3.59413    0.56959   6.310 2.56e-09 ***
## age         -0.07716    0.05322  -1.450    0.149    
## deep        -0.60275    0.75031  -0.803    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08

 

REGRESSION MODEL 3

This seems better :).

 

First we can see some values of residuals which define the difference between the actual values and the predicted values of the exam points i.e. y - y(hat). We are going to take a closer look at the residuals after the next chapter.

Our overall model seems to be significant as a result of F-test: F(1,164) = 38.61, p < .001. The attitude seems to be significant as a predictor or as an explanatory variable: B = 3.5, p < .001. So if we wanted to do some statistical predictions with our model we would adjust the formula y(hat) = a + Bx to y(hat) = 11.64 + 3.5x. Note that a is denoted here as an intercept.

Additionally we can perceive that the attitude explains around 19 % from the variation of the course exam points. We can interpret this with multiple R-squared coefficient instead of adjusted coefficient because we have only one predictor in our model.

In our model we don’t have to worry about multicollinearity since we have only one predictor.

 

Regression_model_3 <- lm(points ~ attitude, data = students2014)
summary(Regression_model_3)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

 

2.4 Interpreting the parameters of regression model 3

 

MODEL PARAMETERS

I interpreted briefly the model parameters in the last chapter. Here the purpose is to interpret more closely the relationship between the attitude and exam points including the multiple R-squared.

A linear regression model with one predictor variable can be expressed with the following equation: Y = Bo + Bx + e. This is the same equation I used earlier with the expression of y(hat) = a + Bx without the residual error e which is an unmeasured variable. So considering our model parameters we have “Bo” or “a” which is the Y-intercept. Then we have B-coefficient (beta) which defines the slope of our regression line.

Then we can take a look at the coefficients of above summary (model 3). It states that our intercept has the value around 11.6 which is the value we would predict for Exam points (Y) if the attitude towards statistics was 0. Then we can take a look the estimated regression coefficient (B) of attitude which seems to be around 3.5. This means that if the variable of attitude (x) differed by one unit the exam points would differ by 3.5 units on average.

Have a look at the scatter plot below. With our equation and estimated coefficients we can express the following example. If some student had evaluated their attitude towards statistics to be 3 then we could predict their exam points to be 11.6 + 3*3.5 = 22.1 by average.

 

ggplot(students2014, aes(x = attitude, y = points)) + geom_point() + xlab("Attitudes towards statistics") + ylab("Exam points") + ggtitle("Attitudes vs. Exam points") + geom_smooth(method = "lm")

 

MULTIPLE R-SQUARED

After the evaluation of the parameters we have the multiple R-squared to explore. Just to remind that in the MODEL 3 the attitude explained around 19 % from the variation of the course exam points.

We can interpret the R-squared as a statistical measure of how close the data are to the fitted regression line. After this information we can express that our coefficient is somewhat moderate. We can perceive the same thing from the above scatter plot where all the observations do not fit to the regression line but instead are spread all over. Thus in general, the higher the R-squared, the better the model fits your data.

However it is noteworthy that the low R-squared value doesn’t automatically mean that our model is bad. For example in the field of human behaviour the R-squared values can be quite low since there might be many other factors affecting to our actions.

There are also some limitations considering the R-squared. For example it cannot determine whether the coefficient estimates and further predictions are biased. This is the reason why we explore the residual plots in the next section.

 

2.5 Graphical model validation

 

Finally I’m going to explore the validity of the model assumptions with graphical outputs including “Residuals vs. Fitted values”, “Normal QQ-plot” and “Residuals vs Leverage”.

First we take a look at the figure called Residuals vs. Fitted (also predicted) values. Ideally this plot doesn’t show any discernible patterns and the residuals should be centered on zero throughout the range of fitted value. The red line assists to observe if there are any distracting trends. From the first plot we can perceive that there is no bigger patterns or trends suggesting that we could use our regression model.

Before making any final conclusions lets take a look at the plot that is Normal QQ-plot of the residuals. Here we are trying to define if the residuals are sufficiently normally distributed. The residuals should fit to the line as smoothly as possible. Based on the graph there seems to be some deviation from the line but I think we can assume that residuals are still sufficiently normally distributed.

Finally we are going to analyse the plot called Residuals vs. Leverage. This plot helps to conclude if there are some influential outliers in our regression model. First we can see that there are not outlying values at the upper right or lower right corner considering the scale also. This suggests that there are no influential cases against our regression line thus we can assume that our leverage is low. On the other hand we have to analyse if there are any cases outside the Cook’s distance which we can perceive underneath the plot. It seems that some of our cases are close to this line with a large negative standardized residual. This means that some of our cases can be influential outliers against our regression line.

 

plot(Regression_model_3, which = c(1, 2, 5))

 

2.6 Conclusions and comments about my own learning

 

The correlation matrixes showed that there could be a relationship between attitude to statistics and course exam points. Other relatioships did not seem as possible. This comes evident after running the three different regression models where the MODEL 3 seemed to be the only significant one. The final model suggests that the attitude towards statistics can explain around 19 % from the variation of the course exam points concluding there might be other meaningful factors to better explain the exam results.

The graphical outputs of the model validation suggest that there are no major problems with residual diagnostics exept some cases which could be influential in terms of Cook’s distance.

I have to say that this journey has been challenging but fun. I’m very greedy to learn more especially about the R coding. If you had asked me some years ago, if I would do some coding in the future, I would had said no. But here we are thanks to the new course of open data science and our fantastic docent and assistants. So thank you guys.

 


3 Logistic Regression

 

The purpose of this chapter is to get familiar with the logistic regression.

 

3.1 Data description

 

I’m going to use a dataset of Student alcohol consumption from UCI Machine Learning Repository. The purpose of the original data have been to predict secondary school student alcohol comsumption in Portugal. This material contains two separate datasets of math and Portuguese language courses but to note that the questionnaires have been identical. There are also students who have answered both questionnaires.

 

=====================================

You can find more detailed information about those two datasets here.

Data reference: Using Data Mining To Predict Secondary School Student Alcohol Consumption. Fabio Pagnotta, Hossain Mohammad Amran. Department of Computer Science,University of Camerino.

======================================

 

In this analysis the purpose is to use combined datset where we have those students who have answered these two questionnaires i.e. students who have participated both in the math and the Portuguese language courses. To combine the two datasets we have utilized some of the backround questions including:

  • school (student’s school: GB - Gabriel Pereira or MS - Mousinho da Silveira)
  • sex (F - female or M - male)
  • age (student’s age: from 15 to 22)
  • address (U - urban or R - rural)
  • famsize (family size: LE3 - less or equal to 3 or GT3 - greater than 3)
  • Pstatus (parent’s cohabitation status: T - living together or A - apart)
  • Medu (mother’s education: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  • Fedu (father’s education: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  • Mjob (mother’s job: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • Fjob (father’s job: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • reason (reason to choose this school: close to ‘home’, school ‘reputation’, ‘course’ preference or ’other)
  • nursery (attended nursery school: yes or no)
  • internet (Internet access at home: yes or no)

 

======================================

You can find the dataset which is utilized in this analysis here.

======================================

 

Underneath you can see all the variables of the combined dataset consisting 382 observations and 35 variables.

The following adjustments have also been made: The variables not used for joining the two data have been combined by averaging (including the grade variables). “alc_use”" is the average of “Dalc” (workday alcohol consumption) and “Walc” (weekend alcohol consumption). “high_use”" is TRUE if “alc_use” is higher than 2 and the scale ranges between 1 and 5 (from 1 - very low to 5 - very high).

 

joined_alcohol <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header = TRUE)

 

NAMES OF DATA VARIABLES

colnames(joined_alcohol)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

 

Underneath you can take a closer look at the content of the data variables. The variables which i haven’t introduce yet includes:

  • guardian (student’s guardian: ‘mother’, ‘father’ or ‘other’)
  • traveltime (home to school travel time: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • studytime (weekly study time: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • failures (number of past class failures: n if 1 <= n < 3, else 4)
  • schoolsup (extra educational support: yes or no)
  • famsup (family educational support: yes or no)
  • paid (extra paid classes within the course subject (Math or Portuguese): yes or no)
  • activities (extra-curricular activities: yes or no)
  • higher (wants to take higher education: yes or no)
  • romantic (with a romantic relationship: yes or no)
  • famrel (quality of family relationships: from 1 - very bad to 5 - excellent)
  • freetime (free time after school: from 1 - very low to 5 - very high)
  • goout (going out with friends: from 1 - very low to 5 - very high)
  • Dalc (workday alcohol consumption: from 1 - very low to 5 - very high)
  • Walc (weekend alcohol consumption: from 1 - very low to 5 - very high)
  • health (current health status: from 1 - very bad to 5 - very good)
  • absences (number of school absences: from 0 to 93)
  • G1 (first period grade: from 0 to 20)
  • G2 (second period grade: from 0 to 20)
  • G3 (final grade: from 0 to 20, output target)

 

STRUCTURE OF THE DATA VARIABLES

str(joined_alcohol)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

 

3.2 My initial hypotheses

 

The purpose of this analysis is to study the relationships between high/low alcohol consumption and 4 variables of my own interest in the data. I think there are many interesting variables which might have the relationship with the alcohol consumption.

 

HYPOTHESIS 1

First variable will be sex i.e. female or male. I believe the gender has a relationship with alcohol consumption but it is impossible to say which gender type has a bigger probability to be a “high user” or “low user”. One of the reasons are that I’m unfamiliar with the culture of Portugal and their habits of alcohol consumption.

 

HYPOTHESIS 2

The second interesting variable will be the weekly studytime which has the possible outcomes of 1 (less than 2 hours), 2 (2 to 5 hours), 3 (5 to 10 hours) or 4 (more than 10 hours).

I believe that the weekly studytime and alcohol consumption have a relationship in the way that the less a student studies the bigger is the probability to be an alcohol “high user” and vice versa.

 

HYPOTHESIS 3

The third variable which I’m going to choose is goout as going out with friends. This variable has the possible values from 1 (very low) to 5 (very high).

The variable of going out with friends doesn’t tell us which kind of activities it consists but I believe that the more one goes out the bigger probability is to use alcohol. That is the more the student goes out the greater might be the probability to be an alcohol “high user” and vice versa.

 

HYPOTHESIS 4

The last variable which interest me is famrel as quality of family relationships which ranges from 1 (very bad) to 5 (excellent).

I believe there can be a relationship between family relationships and alcohol consumption in such a way that the worse these relationships are the bigger is the probability to be a “high user” of alcohol. Of course in a bigger picture the other relevant aspect could be the overall importance of these family relationships which can define if these relationships affect to the student’s alcohol consumption at all.

 

3.3 Relationships between the variables in terms of my initial hypotheses

 

In this section I’m going to explore the distributions of the variables which were introduced in the previous chapter. The second purpose is to explore the possible relationships between alcohol consumption and the chosen variables.

 

library(tidyr); library(dplyr); library(ggplot2); library(knitr)

 

SEX AND ALCOHOL CONSUMPTION.

As you can see from the table below there are 198 females and 184 males in the dataset thus the groups seem to be nearly same size. Furthermore there are 382 students in the dataset.

 

sex <- joined_alcohol$sex
alc_use <- joined_alcohol$alc_use
high_use <- joined_alcohol$high_use
joined_alcohol %>% count(sex) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 2 × 3
##      sex     n     freq
##   <fctr> <int>    <dbl>
## 1      F   198 51.83246
## 2      M   184 48.16754

 

The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).

The table also compares the alcohol consumption between genders and it suggests that there are somewhat more “high users” among men (n = 71) than women (n = 41). Other way round the table suggests that there are somewhat more “low users” among women (n = 157) than men (n = 113. These results suggest that there could be a relationship between sex and alcohol consumption.

 

joined_alcohol %>% count(sex, high_use) %>% mutate(freq = n / sum(n = 382) * 100)
## Source: local data frame [4 x 4]
## Groups: sex [2]
## 
##      sex high_use     n     freq
##   <fctr>    <lgl> <int>    <dbl>
## 1      F    FALSE   157 41.09948
## 2      F     TRUE    41 10.73298
## 3      M    FALSE   113 29.58115
## 4      M     TRUE    71 18.58639

 

The first bar chart illustrates the alcohol consumption between genders wheras the second chart clarifies the high use of alcohol between genders. Thus you can see the same results as described above.

 

g1 <- ggplot(data = joined_alcohol, aes(x = alc_use, fill = sex))
g1 + geom_bar() + ggtitle("Alcohol consuption between genders")

g2 <- ggplot(data = joined_alcohol, aes(x = high_use, fill = sex))
g2 + geom_bar() + ggtitle("Alcohol high use between genders")

 

WEEKLY STUDYTIME AND ALCOHOL CONSUMPTION.

The first 3 tables illustrate the weekly studytime in general and between genders. To recall the measurement scale which ranges between 1 and 4, where

  • 1 = less than 2 hours per week
  • 2 = 2 to 5 hours per week
  • 3 = 5 to 10 hours per week
  • 4 = more than 10 hours per week

 

The first table discribes the total amount of studytime between students and as seen the popular studytime seems to vary from 2 to 5 hours per week considering the course. The same emphasis can be seen from the second table which indicates that the mean studytime is around 2 (2 to 5 hours per week).

The third table divides the mean of the studytime between genders and it suggests that female may spend more time to study than do males.

 

studytime <- joined_alcohol$studytime
joined_alcohol %>% count(studytime) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 4 × 3
##   studytime     n      freq
##       <int> <int>     <dbl>
## 1         1   103 26.963351
## 2         2   190 49.738220
## 3         3    62 16.230366
## 4         4    27  7.068063
summary(studytime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.034   2.000   4.000
joined_alcohol %>% group_by(sex) %>% summarise(count = n(), mean = mean(studytime))
## # A tibble: 2 × 3
##      sex count     mean
##   <fctr> <int>    <dbl>
## 1      F   198 2.272727
## 2      M   184 1.777174

 

The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).

The table also compares the high use of alcohol between different amount of studytime. If you look at the frequencies (%), which are adjusted to count the percentages inside the unit of studytime, you can perceive that the frequencies of high use are somewhat lower inside the units of 3 and 4 and somewhat higher inside the units of 1 and 2. I think this may correspond with my hyothesis 2 where i suggested that the less a student studies the higher might be the probability to be a high user of alcohol.

 

joined_alcohol %>% count(studytime, high_use) %>% mutate(freq = n / sum(n) * 100)
## Source: local data frame [8 x 4]
## Groups: studytime [4]
## 
##   studytime high_use     n     freq
##       <int>    <lgl> <int>    <dbl>
## 1         1    FALSE    60 58.25243
## 2         1     TRUE    43 41.74757
## 3         2    FALSE   133 70.00000
## 4         2     TRUE    57 30.00000
## 5         3    FALSE    54 87.09677
## 6         3     TRUE     8 12.90323
## 7         4    FALSE    23 85.18519
## 8         4     TRUE     4 14.81481

 

The boxplot below compares the high alcohol consumption to students’ weekly studytime between genders.

 

g2 <- ggplot(data = joined_alcohol, aes(x = high_use, y = studytime, col = sex))
g2 + geom_boxplot() + ggtitle("High use consumption vs weekly studytime")

 

GOING OUT WITH FRIENDS AND ALCOHOL CONSUMPTION.

The first 3 tables illustrate the variable of going out with friends. The measurement scale varies between 1 (very low) and 5 (very high) and I assume that the scale means to count going out often or not so often.

The first table describes the total amounts of going out between students and the summary table indicates that the mean of going out is around 3.

The third table divides the mean of the going out between genders and it suggests that there might not be any differences between genders.

 

goout <- joined_alcohol$goout
joined_alcohol %>% count(goout) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 5 × 3
##   goout     n      freq
##   <int> <int>     <dbl>
## 1     1    24  6.282723
## 2     2    99 25.916230
## 3     3   123 32.198953
## 4     4    82 21.465969
## 5     5    54 14.136126
summary(goout)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.113   4.000   5.000
joined_alcohol %>% group_by(sex) %>% summarise(count = n(), mean = mean(goout))
## # A tibble: 2 × 3
##      sex count     mean
##   <fctr> <int>    <dbl>
## 1      F   198 3.045455
## 2      M   184 3.184783

 

The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).

The table compares the high use of alcohol between different frequencies (i.e. not going out often vs. going out often) of going out with friends. If you look at column of frequencies (%), which are adjusted to count the percentages inside the category of going out, you can perceive that the relative frequencies of alcohol high use are somewhat growing the more the student goes out with her/his friends. This may correspond with my hypothesis 3 where I suggested that the more the student goes out the greater might be the probability to be an alcohol high user and vice versa.

 

joined_alcohol %>% count(goout, high_use) %>% mutate(freq = n / sum(n) * 100)
## Source: local data frame [10 x 4]
## Groups: goout [5]
## 
##    goout high_use     n     freq
##    <int>    <lgl> <int>    <dbl>
## 1      1    FALSE    21 87.50000
## 2      1     TRUE     3 12.50000
## 3      2    FALSE    84 84.84848
## 4      2     TRUE    15 15.15152
## 5      3    FALSE   100 81.30081
## 6      3     TRUE    23 18.69919
## 7      4    FALSE    43 52.43902
## 8      4     TRUE    39 47.56098
## 9      5    FALSE    22 40.74074
## 10     5     TRUE    32 59.25926

 

The boxplot below compares the high alcohol consumption to frequency to go out with friends between genders.

 

g3 <- ggplot(data = joined_alcohol, aes(x = high_use, y = goout, col = sex))
g3 + geom_boxplot() + ggtitle("High use consumption vs going out with friends")

 

FAMILY RELATIONSHIPS AND ALCOHOL CONSUMPTION.

The first 3 tables illustrate the variable of quality of family relationships. This variable has been measured from 1 to 5 describing the family relationships to vary between very bad and excellent.

The first table describes the total amounts of qualities in each category and it unfortunately suggests that these family relationships are somewhat bad by average. According the second table indicates that the mean quality is around 4 which I interpreted to illustrate bad quality of family relationships.

The third table divides the above mentioned mean on to categories of different genders. These results implies that there might not be any differences between genders considering the quality of family relationships.

 

famrel <- joined_alcohol$famrel
joined_alcohol %>% count(famrel) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 5 × 3
##   famrel     n      freq
##    <int> <int>     <dbl>
## 1      1     9  2.356021
## 2      2    18  4.712042
## 3      3    66 17.277487
## 4      4   183 47.905759
## 5      5   106 27.748691
summary(famrel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    4.00    3.94    5.00    5.00
joined_alcohol %>% group_by(sex) %>% summarise(count = n(), mean = mean(famrel))
## # A tibble: 2 × 3
##      sex count     mean
##   <fctr> <int>    <dbl>
## 1      F   198 3.878788
## 2      M   184 4.005435

 

The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).

The table compares the high use of alcohol between different qualities of family relationships. Again the frequencies (%) have adjusted to count the percentages inside each category since in this way it is easier to compare relative frequencies between different quality categories.

It seems that high use is more common somewhere in the middle between bad and average kind of family relationships and the extreme values (very bad and excellent) don’t seem differ from eac other. Thus these results may not correspond with my hypothesis 4 where I suggested that the worse the relatioships are the higher is theprobability to be an alcohol high user.

 

joined_alcohol %>% count(famrel, high_use) %>% mutate(freq = n / sum(n) * 100)
## Source: local data frame [10 x 4]
## Groups: famrel [5]
## 
##    famrel high_use     n     freq
##     <int>    <lgl> <int>    <dbl>
## 1       1    FALSE     7 77.77778
## 2       1     TRUE     2 22.22222
## 3       2    FALSE     9 50.00000
## 4       2     TRUE     9 50.00000
## 5       3    FALSE    40 60.60606
## 6       3     TRUE    26 39.39394
## 7       4    FALSE   131 71.58470
## 8       4     TRUE    52 28.41530
## 9       5    FALSE    83 78.30189
## 10      5     TRUE    23 21.69811

 

The boxplot below compares the high alcohol consumption to qualities of family relationships between genders.

 

g4 <- ggplot(data = joined_alcohol, aes(x = high_use, y = famrel, col = sex))
g4 + geom_boxplot() + ggtitle("High use consumption vs family relationships")

 

3.4 Logistic regression model

 

In this chapter the purpose is to use logistic regression and find out if the previously chose variables are related to the high use of alcohol consumption and if the earlier suggested hypotheses can be considered as appropriate.

Before fitting the model I will inspect if there are any missing values in the combined data of alcohol consumption. The table below indicates that there are no missing values since the false is 13 370.

 

TABLE OF THE MISSING VALUES

table(is.na(joined_alcohol))
## 
## FALSE 
## 13370

 

SUMMARY OF THE FITTED MODEL

My model contains both categorical (i.e. “sex”) and continuous variables (i.e. other variables) as independent variables.

First we can see that all the coefficients of independent variables are significant. The variable of the sex suggests a significant association of the sex of the student with the probability of the high use alcohol consumption. The positive coefficient for this predictor suggests that all other variables being equal, the male student is more likely to be a high user of alcohol compared to females. In my first hypothesis I suggested that the sex might have a relatioship with alcohol consumption but I couldn’t state if there are any differencies between genders.

Then we have the significant variable of weekly studytime suggesting an association of studytime with the probability of the high use of alcohol. The negative coefficient for this predictor implies that if the studytime increases, the probability (log odds) of being a high user of alcohol decreases if all other variables remain constant. This result corresponds with my second hypothesis where I suggested that the weekly studytime and alcohol consumption have a relationship in the way that the less a student studies the higher is the probability to be a high user and vice versa.

The variable of going out with friends (“goout”) has the lowest p-value suggesting a strong association of the going out with the probability of being an alcohol high user. The positive coefficient for the predictor suggests if one goes out more, the probability (log odds) of being a high user increases. This result is significant if all other variables remain constant. This corresponds with my third hypothesis where I suggested that the more the student goes out with her/his friends the higher may be the probability to be a high user and vice versa.

The last variable is the famrel which represents the quality of family relationhips. The significance of the factor suggests an association of the family relationships with the probability of being an alcohol high user. The negative coefficient implies that the better these relatinships are the higher is the probability (log odds) to be a low user of alcohol. Thus somewhat surprisingly this corresponds with fourth hypothesis. The surprising part comes from the chapter 3.3 after exploring the data structure of alcohol high use together with family relationships. After reading the summaries I started to doubt the possible relationship between these variables.

 

my_model <- glm(high_use ~ sex + studytime + goout + famrel, data = joined_alcohol, family = "binomial")
summary(my_model)
## 
## Call:
## glm(formula = high_use ~ sex + studytime + goout + famrel, family = "binomial", 
##     data = joined_alcohol)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5928  -0.7494  -0.4919   0.8250   2.7080  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2112     0.7142  -1.696  0.08991 .  
## sexM          0.7593     0.2655   2.860  0.00424 ** 
## studytime    -0.4752     0.1703  -2.790  0.00527 ** 
## goout         0.7947     0.1218   6.523 6.89e-11 ***
## famrel       -0.4496     0.1379  -3.260  0.00111 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 379.27  on 377  degrees of freedom
## AIC: 389.27
## 
## Number of Fisher Scoring iterations: 4

 

ODDS RATIOS

Here the coefficients of the above regression model are exponentiated and therefore we can interpret them as odds-ratios (OR). Additionally the table below contain the confidence intervals (CI) of the odds-ratios.

 

The confidence intervals of the table below suggest that the results of my logistic regression model can be considered as significant. This is because all of the coefficients are positive and their variations don’t cross the zero.

OR <- coef(my_model) %>% exp
CI <- confint(my_model) %>% exp
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2978484 0.07158146 1.1881189
## sexM        2.1368457 1.27450527 3.6171734
## studytime   0.6217330 0.44091699 0.8617913
## goout       2.2138196 1.75581193 2.8339395
## famrel      0.6378742 0.48488876 0.8343189

 

3.5 The predictive power of my logistic model

 

probabilities <- predict(my_model, type = "response")
joined_alcohol <- mutate(joined_alcohol, probability = probabilities)
joined_alcohol <- mutate(joined_alcohol, prediction = probability > 0.5)

table(high_use, prediction = probabilities > 0.5)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   26
##    TRUE     66   46

 

library(dplyr)
g5 <- ggplot(joined_alcohol, aes(x = probability, y = high_use, col = prediction))
g5 + geom_point()

 

table(high_use, prediction = joined_alcohol$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   26
##    TRUE     66   46
table(high_use, prediction = joined_alcohol$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.63874346 0.06806283
##    TRUE  0.17277487 0.12041885

 


4 Clustering and classification

 

4.1 Brief description of the dataset

 

In this chapter the purpose is to get familiar with clustering and classification. I’m going to use a dataset called Boston which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The data was originally published by Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978. You can find some additional information here and here.

This dataset is actually already loaded into R thus let’s take a look at some details.

 

DATA VARIABLES

  • crim - per capita crime rate by town
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus - proportion of non-retail business acres per town
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • nox - nitrogen oxides concentration (parts per 10 million)
  • rm - average number of rooms per dwelling
  • age - proportion of owner-occupied units built prior to 1940
  • dis - weighted mean of distances to five Boston employment centres
  • rad - index of accessibility to radial highways
  • tax - full-value property-tax rate per $10,000
  • ptratio - pupil-teacher ratio by town
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • lstat - lower status of the population (percent)
  • medv - median value of owner-occupied homes in $1000s

 

STRUCTURE OFTHE BOSTON DATASET

From the structure table you can notice that there are 506 observation and 14 variables including 13 continuous variables and 1 binary-valued variable “chas”.

 

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

 

4.2 Summaries and graphs

 

In this section I’m going to explore the distributions of data variables even further. First I show you a summary table which contains the distributions of each variable where the variables are divided into quartiles including the minimum and maximum values.

After this I’ll show you the histograms of these variables excluding chas and black. Chas is a dummy variable concisting the values 1 or 0. The reason for me of dropping black is ethical. Remember that the current dataset is already “long in the tooth”. However, from the histograms you can have a better understanding of the variation of each variable.

At the end of this chapter I’ll explore the correlations between the variables and demonstrate them by means of the the correlation matrix table and the correlation matrix plot.

 

SUMMARY TABLE OF DATA VARIABLES

From the summary table below you can perceive that all of the ditributions varies between very different scales.

 

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

 

HISTOGRAMS OF CRIM, ZN, INDUS AND NOX

library(ggplot2); library(gridExtra);library(dygraphs)

# INITIALIZING THE PLOTS

# Histogram of crim
hcrim <- qplot(Boston$crim, geom = "histogram", binwidth = 6, main = "Crime rate by town", xlab = "crime rate", col = I("grey"), fill = I("chartreuse3"))

# Histogram of zn
hzn <- qplot(Boston$zn, geom = "histogram", binwidth = 6, main = "Prop. of residential land zoned", xlab = "zn", col = I("grey"), fill = I("orange"))

# Histogram of indus
hindus <- qplot(Boston$indus, geom = "histogram", binwidth = 3, main = "Prop. of non-retail business", xlab = "indus", col = I("grey"), fill = I("mediumorchid"))

hnox <- qplot(Boston$nox, geom = "histogram", binwidth = 0.02, main = "Nitrogen oxides concentration", xlab = "nox", col = I("grey"), fill = I("blue"))


# Combining the plots using the function grid.arrange() [in gridExtra] 
  
# Additional information from (http://www.sthda.com/english/wiki/ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page-r-software-and-data-visualization)

grid.arrange(hcrim, hzn, hindus, hnox, ncol=2, nrow =2)

 

HISTOGRAMS OF RM, AGE, DIS AND RAD

library(ggplot2); library(gridExtra); library(dygraphs)

# INITIALIZING THE PLOTS

hrm <- qplot(Boston$rm, geom = "histogram", binwidth = 0.5, main = "Number of rooms", xlab = "rooms", col = I("grey"), fill = I("yellow"))

hage <- qplot(Boston$age, geom = "histogram", binwidth = 10, main = "Owner-occupied units", xlab = "age of the unit", col = I("grey"), fill = I("chocolate"))

hdis <- qplot(Boston$dis, geom = "histogram", binwidth = 1, main = "distances to 5 employment centres", xlab = "distance", col = I("grey"), fill = I("cornflowerblue"))

hrad <- qplot(Boston$rad, geom = "histogram", binwidth = 1, main = "Access to radial highways", xlab = "radial highways", col = I("grey"), fill = I("coral"))

# Combining the plots using the function grid.arrange() [in gridExtra] 

grid.arrange(hrm, hage, hdis, hrad, ncol=2, nrow =2)

 

HISTOGRAMS OF TAX, PTRATIO, LSTAT AND MEDV

library(ggplot2); library(gridExtra); library(dygraphs)

# INITIALIZING THE PLOTS

htax <- qplot(Boston$tax, geom = "histogram", binwidth = 30, main = "Property tax rate per $10,000", xlab = "tax rate", col = I("grey"), fill = I("red"))

hptratio <- qplot(Boston$ptratio, geom = "histogram", binwidth = 1, main = "Pupil-teacher ratio by town", xlab = "pupil-teacher ratio", col = I("grey"), fill = I("purple"))

hlstat <- qplot(Boston$lstat, geom = "histogram", binwidth = 2, main = "Lower status of the population %", xlab = "lower status", col = I("grey"), fill = I("palegreen4"))

hmedv <- qplot(Boston$medv, geom = "histogram", binwidth = 4, main = "Owner-occupied homes in $1000s", xlab = "Owner-occupied homes", col = I("grey"), fill = I("violetred2"))

# Combining the plots using the function grid.arrange() [in gridExtra] 

grid.arrange(htax, hptratio, hlstat, hmedv, ncol=2, nrow =2)



CORRELATION MATRIX BETWEEN DATA VARIABLES

library(tidyverse)
cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

 

PLOTTING THE CORRELATION MATRIX

I used the corrplot() to better demonstrate the above correlation matrix. You can see this plot below and it is good to know that positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. The legend color shows the correlation coefficients and the corresponding colors. I also added the values of correlations of each pair even though the plot is not so beautiful after this.

You perceive that the highest correlations are between nox and indus as well as between tax and indus. This is kind of logical that nitrogen oxide levels and tax rates are higher in industrial areas.

Within the next chapters I’m going to concentrate more on crime rates as a target variable thus its also convenient to explore here the relationships between crime rate and other variables. You can see the highest positive correlations are between crime rate and rad as well as crime rate and tax rate. I’m not sure what these are actuallyt telling us. An by this I mean that sometimes variables can correlate without any specific cause - effect impacts. Here the crime rate by town correlates positive with accessibility to radial highways but I can’t figure out any “real” reasons why this is happening. Maybe the criminals do their crimes in towns from where they can get easily away :D :D

And maybe the positive correlation between crime rate and tax rate indicates that there are more crimes in those areas with bigger properties because bigger properties have bigger taxes. Also there might live wealthier people in bigger properties which may be more tempting for example to the thiefs. I’m not sure what kind of crimes the crime rate contains but I think you get my point.

Then there are more moderate negative correlations between crime rate and medv as well as between crime rate and dis. The medv indicates the median value of owner-occupied homes in $1000s and the correlation suggests that the crime rate is higher when the median value of owner-occupied homes is lower and vice versa. I don’t try to interpret this unattached from the bigger picture because there can be several reasons fro this. The dis indicates the weighted mean of distances to five Boston employment centres. Thus the negative correlation tells us that the higher the crime rate, the lower the mean distance and vice versa. This may be logical in the way that there can more crimes near these centres (shorter distance) comparing the more rural areas (greater distance).

Note that under the corrplot you can find a pair (scatter plot) matrix which illustrates also the relationships between different variables.

 

library(corrplot); library(dygraphs)
corrplot(cor_matrix, method = "circle", main = "Correlation matrix of Boston variables", tl.cex = 1, addCoef.col = "black")

library(ggplot2)
library(GGally)
pairs(Boston [-4], main = "Scatter plot matrix")

 

4.3 Scaling the Boston dataset

 

In this chapter I’m going to scale or standardize the whole dataset.

 

SUMMARIES OF THE SCALED DATA

As you can see from the table below, the distributions of the variables changed after stardardizing the dataset. Now the mean of every variable is zero and the observations varies on both sides of zero. This means that the different distributions are now more comparable for the further purpose.

 

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

 

CATEGORIZING THE CRIME RATE

There are a lot of technicalities in this part thus I decided to describe the steps inside the r code chunks. the main point here is to create a new categorical variable based on the scaled crime rate and split it on to 4 categories by means of the its quantiles labelled as “low”, “med_low”, “med_high” and “high”. After this I remove the “original” scaled crime rate from the dataset.

 

# Converting the scaled data into a data frame
boston_scaled <- as.data.frame(boston_scaled)

scaled_crim <- boston_scaled$crim

# Using the quantiles as the break points in the categorical variable of scaled_crim
bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# Creating a categorical variable "crime"
boston_scaled$crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

crime <- boston_scaled$crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Verifying the content
colnames(boston_scaled)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"   
## [15] "crime"
class(boston_scaled)
## [1] "data.frame"

 

# Dropping the old crime rate from the dataset
library(dplyr)
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Veryfying the content
colnames(boston_scaled)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865

 

CREATING THE TRAIN AND TEST SETS

# Choosing randomly 80 % of the row
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)

# Creating train and test sets
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

# Inspecting the content
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  3.372 -0.487 -0.487 -0.487 0.585 ...
##  $ indus  : num  -1.19 1.015 -0.437 1.015 -0.915 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -1.335 1.599 -0.144 0.978 -1.111 ...
##  $ rm     : num  1.1434 0.251 -0.4763 -1.9621 0.0247 ...
##  $ age    : num  -1.697 0.878 0.477 1.116 -1.292 ...
##  $ dis    : num  1.668 -0.8512 0.0926 -1.2446 0.7625 ...
##  $ rad    : num  -0.982 1.66 -0.637 1.66 -0.637 ...
##  $ tax    : num  -0.731 1.529 -0.601 1.529 -0.755 ...
##  $ ptratio: num  -1.458 0.806 1.175 0.806 0.251 ...
##  $ black  : num  0.417 -3.606 -1.359 0.441 0.441 ...
##  $ lstat  : num  -0.673 0.756 2.109 3.097 -0.831 ...
##  $ medv   : num  1.051 -1.406 -1.015 -0.95 0.247 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 4 3 4 1 2 1 1 3 2 ...
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  0.2845 0.0487 0.0487 -0.4872 -0.4872 ...
##  $ indus  : num  -1.287 -0.476 -0.476 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.265 -0.265 -0.144 -0.144 ...
##  $ rm     : num  0.413 0.131 -0.392 -0.268 -0.419 ...
##  $ age    : num  -0.12 0.914 0.509 0.566 0.466 ...
##  $ dis    : num  0.14 1.212 1.155 0.317 0.22 ...
##  $ rad    : num  -0.982 -0.522 -0.522 -0.637 -0.637 ...
##  $ tax    : num  -0.666 -0.577 -0.577 -0.601 -0.601 ...
##  $ ptratio: num  -1.46 -1.5 -1.5 1.18 1.18 ...
##  $ black  : num  0.441 0.393 0.441 0.256 0.329 ...
##  $ lstat  : num  -1.0745 1.0918 0.0864 -0.3351 0.2824 ...
##  $ medv   : num  0.16 -0.819 -0.395 -0.471 -0.547 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 2 2 3 3 3 2 2 1 2 ...
colnames(train)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"
colnames(test)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"

 

4.4 Linear Discriminant Analysis, LDA

 

Linear discriminant analysis is a classification technique and its preferred over Logistic regression if the target variable has more than two classes. LDA has some assumption about the data and one of these is that the input variables are normally distributed. It also assumes that each attribute has the same variance Reference.

Note that the data was standardized earlier in terms of the latter assumption.

 

FITTING THE LINEAR DISCRIMINANT ANALYSIS

From the table below we can see for example that the model predicts 24.3% of crime rate in the training set correspond to the low crime rate. The output provides also the group means which are the average of each predictor within each class.

LDA defines as many “discriminant functions” as the number of categories of the outcome minus one. I have four categories thus the LDA output shows three functions under the headline “Proportion of trace”. These functions are ordered by the amount of variance that they explain.

 

library(MASS)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2450495 0.2400990 0.2524752 
## 
## Group means:
##                  zn       indus         chas        nox         rm
## low       0.9273008 -0.89527070 -0.123759247 -0.8799761  0.4172776
## med_low  -0.1444396 -0.24360599  0.006051757 -0.5414810 -0.1406833
## med_high -0.3891090  0.09574283  0.133557555  0.3445314  0.1463522
## high     -0.4872402  1.01710965 -0.002135914  1.1002224 -0.4521290
##                 age        dis        rad        tax      ptratio
## low      -0.8952872  0.8850345 -0.6871702 -0.7106737 -0.401711974
## med_low  -0.2792209  0.3181166 -0.5549663 -0.4767992 -0.001856142
## med_high  0.3931362 -0.3397152 -0.3981658 -0.3201598 -0.177080044
## high      0.8227129 -0.8671419  1.6382099  1.5141140  0.780871766
##               black       lstat        medv
## low       0.3767806 -0.76124286  0.48038350
## med_low   0.3046130 -0.09271075 -0.01652675
## med_high  0.1085334 -0.01961523  0.17062496
## high     -0.8544820  0.92681002 -0.67182938
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.10223149  0.74422920 -0.83251835
## indus   -0.03656348 -0.13966091  0.56276249
## chas    -0.06782568 -0.05985517  0.11145542
## nox      0.49258992 -0.71223344 -1.52540781
## rm      -0.13741756 -0.11026071 -0.13330158
## age      0.21884930 -0.29741623 -0.17203708
## dis     -0.04334737 -0.20256325  0.13177401
## rad      3.17307059  0.98796115  0.03507097
## tax     -0.03888746 -0.03045103  0.42357712
## ptratio  0.17133986 -0.04566881 -0.26898349
## black   -0.13828129  0.00773070  0.09204342
## lstat    0.24683838 -0.28766836  0.44141652
## medv     0.24369618 -0.38437390 -0.13270258
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9522 0.0351 0.0127

 

# Creating a numeric vector of the train sets crime classes for plotting purposes
classes <- as.numeric(train$crime)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# plotting the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.5)

 

4.5 Predicting the classes with the Linear Discriminant Analysis, LDA

 

LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The class that gets the highest probability is the output class and a prediction is made Reference.

First I’m going to save the four crime classes from the test set to “correct_classes” and then I remove the variable from test set, so that I can use the categorized crime as a target variable in LDA prediction when predicting the test data.

 

SAVING THE CRIME CATEGORIES

# Saving the crime classes from the test data into correct_classes
correct_classes <- test$crime

# Removing crime from the test set
library(dplyr)
test <- dplyr::select(test, -crime)

# Veryfying the content of test and correct_classes
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  0.2845 0.0487 0.0487 -0.4872 -0.4872 ...
##  $ indus  : num  -1.287 -0.476 -0.476 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.265 -0.265 -0.144 -0.144 ...
##  $ rm     : num  0.413 0.131 -0.392 -0.268 -0.419 ...
##  $ age    : num  -0.12 0.914 0.509 0.566 0.466 ...
##  $ dis    : num  0.14 1.212 1.155 0.317 0.22 ...
##  $ rad    : num  -0.982 -0.522 -0.522 -0.637 -0.637 ...
##  $ tax    : num  -0.666 -0.577 -0.577 -0.601 -0.601 ...
##  $ ptratio: num  -1.46 -1.5 -1.5 1.18 1.18 ...
##  $ black  : num  0.441 0.393 0.441 0.256 0.329 ...
##  $ lstat  : num  -1.0745 1.0918 0.0864 -0.3351 0.2824 ...
##  $ medv   : num  0.16 -0.819 -0.395 -0.471 -0.547 ...
colnames(test)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
str(correct_classes)
##  Factor w/ 4 levels "low","med_low",..: 1 2 2 3 3 3 2 2 1 2 ...

 

summary(test)
##        zn              indus               chas          
##  Min.   :-0.4872   Min.   :-1.43094   Min.   :-0.272329  
##  1st Qu.:-0.4872   1st Qu.:-0.87558   1st Qu.:-0.272329  
##  Median :-0.4872   Median :-0.16425   Median :-0.272329  
##  Mean   : 0.0338   Mean   : 0.05866   Mean   :-0.002136  
##  3rd Qu.: 0.3703   3rd Qu.: 1.01499   3rd Qu.:-0.272329  
##  Max.   : 3.8005   Max.   : 2.42017   Max.   : 3.664771  
##       nox                 rm                age           
##  Min.   :-1.42991   Min.   :-2.51294   Min.   :-2.205237  
##  1st Qu.:-0.91860   1st Qu.:-0.55882   1st Qu.:-0.804647  
##  Median :-0.14407   Median :-0.10622   Median : 0.293976  
##  Mean   : 0.01218   Mean   : 0.01586   Mean   : 0.004827  
##  3rd Qu.: 0.57651   3rd Qu.: 0.41042   3rd Qu.: 0.910342  
##  Max.   : 2.72965   Max.   : 2.97511   Max.   : 1.116390  
##       dis               rad               tax           
##  Min.   :-1.1746   Min.   :-0.9819   Min.   :-1.312691  
##  1st Qu.:-0.8194   1st Qu.:-0.6373   1st Qu.:-0.772751  
##  Median :-0.2869   Median :-0.5225   Median :-0.452346  
##  Mean   :-0.0383   Mean   :-0.0068   Mean   :-0.008329  
##  3rd Qu.: 0.4422   3rd Qu.: 1.2002   3rd Qu.: 1.529413  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.796416  
##     ptratio             black              lstat         
##  Min.   :-2.70470   Min.   :-3.86686   Min.   :-1.33637  
##  1st Qu.:-1.13422   1st Qu.: 0.25071   1st Qu.:-0.73526  
##  Median : 0.06673   Median : 0.39445   Median :-0.31061  
##  Mean   :-0.19321   Mean   : 0.06406   Mean   :-0.02708  
##  3rd Qu.: 0.80578   3rd Qu.: 0.43873   3rd Qu.: 0.57862  
##  Max.   : 1.63721   Max.   : 0.44062   Max.   : 3.40663  
##       medv         
##  Min.   :-1.90634  
##  1st Qu.:-0.57440  
##  Median :-0.10142  
##  Mean   : 0.02639  
##  3rd Qu.: 0.26282  
##  Max.   : 2.98650

 

PREDICTING CLASSES WITH TEST DATA

In this part I’m going to predict the classes with the test data. In addition to our exercises in DataCamp I used the actual command of confusionMatrix() to illustrate the results more clearly. And because of this the table of predicted values comes actually twice. I kept the first table because I wanted to be sure that I manged to code the confusion matrix correctly.

From the confusion matrix we can see that the estimated accuracy on the final model is 73.5%. You can see the same result also from the table where the correct classes are against the predicted classes. And if you like you can count the accuracy % by hand: sum of correctly predicted values divided with the total amount of values.

 

# Predicting classes with the test data
lda.pred <- predict(lda.fit, newdata = test)
lda.pred$class
##   [1] low      med_low  med_low  med_high med_high med_high med_low 
##   [8] med_low  low      med_low  med_low  med_low  med_low  low     
##  [15] med_low  med_low  med_high med_high med_low  med_high med_high
##  [22] med_high med_high med_high med_high med_high med_high med_high
##  [29] med_high med_high med_high med_high med_low  med_low  med_high
##  [36] med_low  med_high low      low      low      low      med_low 
##  [43] med_low  med_low  med_high med_low  low      med_low  med_low 
##  [50] low      low      med_high med_high med_high low      low     
##  [57] low      low      low      med_low  med_low  low      low     
##  [64] low      med_low  med_high low      med_low  med_low  low     
##  [71] low      high     high     high     high     high     high    
##  [78] high     high     high     high     high     high     high    
##  [85] high     high     high     high     high     high     high    
##  [92] high     high     high     high     high     high     med_low 
##  [99] med_high med_high med_high med_high
## Levels: low med_low med_high high
# Cross tabulating the result
table1 <- table(correct = correct_classes, predicted = lda.pred$class)
table1
##           predicted
## correct    low med_low med_high high
##   low       14       5        2    0
##   med_low    7      15        5    0
##   med_high   0       6       22    1
##   high       0       0        0   25
# Some of the packages can be irrelevant. Different sites proposed different packages and then R noted to require some other packages
library(caret); library(lattice); library(mlbench); library(e1071)
confusionMatrix(correct_classes, lda.pred$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low med_low med_high high
##   low       14       5        2    0
##   med_low    7      15        5    0
##   med_high   0       6       22    1
##   high       0       0        0   25
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7451          
##                  95% CI : (0.6492, 0.8262)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6587          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: med_low Class: med_high Class: high
## Sensitivity              0.6667         0.5769          0.7586      0.9615
## Specificity              0.9136         0.8421          0.9041      1.0000
## Pos Pred Value           0.6667         0.5556          0.7586      1.0000
## Neg Pred Value           0.9136         0.8533          0.9041      0.9870
## Prevalence               0.2059         0.2549          0.2843      0.2549
## Detection Rate           0.1373         0.1471          0.2157      0.2451
## Detection Prevalence     0.2059         0.2647          0.2843      0.2451
## Balanced Accuracy        0.7901         0.7095          0.8314      0.9808

 

4.6 Clustering

 

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a dataset. When we cluster the observations of a dataset, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other.

In this chapter I’m going to reload the Boston dataset for the clustering purposes. First I’m going to scale the variables to get comparable distances between the observations. After this I’m going to run the k-means algorithm and investigate the optimal number of clusters.

K-means clustering is an unsupervised learning algorithm which means that there is no outcome to be predicted but instead tries to find patterns in the data.

 

RELOADING THE BOSTON DATASET

reboston <- Boston
str(reboston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

 

SUMMARY OF NOT SCALED BOSTON DATASET

summary(reboston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

 

SUMMARY OF SCALED BOSTON DATASET

reboston_scaled <- scale(reboston)
summary(reboston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

 

CALCULATING THE DISTANCES

Here I’m going to calculate the Euclidean distance which is a very common ordinary or straight-line distance measure. It is used by measuring similarity or dissimilarity of objects.

 

# Euclidean distance
dist_eu <- dist(reboston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

 

K-MEANS AND INITIAL AMOUNT OF CLUSTERS

K-means is one of the best-known clustering approaches whish is applied to partition the observations into a pre-specified number of clusters.

In this section I present some k-means pairplots adjusted with the euclidean distance and some random counts of centers.

 

library(dygraphs)


km_eu <- kmeans(dist_eu, centers = 2)
pairs(reboston_scaled, col = km_eu$cluster, main = "Pairs with 2 clusters")

km_eu2 <- kmeans(dist_eu, centers = 4)
pairs(reboston_scaled, col = km_eu2$cluster, main = "Pairs with 4 clusters")

 

DETERMINING THE NUMBER OF CLUSTERS

Here the purpose is to find the optimal number of clusters. Idea behind the K-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible. There are different methods to achieve this goal and I’m going to use total within sum of squares, WCSS, which can be calculated with a use of formula below:

\(WCSS = \sum_i^N (x_i - centroid)^2\)

 

The formula below is adopted from the DataCamp exercises as such. The task is to found an appropriate amount of clusters. The DC guidelines adviced to choose that number of clusters where the total WCSS drops radically. So if you look at the figure below, the most radical drop happens after the first number suggestings that the optimal amount could be 2 clusters.

On the other hand the statistical textbooks (e.g. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf) suggest that the clustering “results” may be optimal when the within-cluster variation is as small as possible.

I thought I could somehow combine the above advices and this is what I concluded: The second “cluster point” doens’t reflect the smallest variation even though there is a radical change after first “point. It seems that there are some drops between the second and sixth”point“, but between the fifth and sixth the within-cluster variation doesn’t seem be to be so different anymore. After this you may already guess that I’m suggesting 5 clusters. But I wouldn’t say that this has to be a final decision without using other tools and methods.

 

set.seed(123) # keeping the results same
k_max <- 10 # maximum number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b', main = "Number of proposed clusters")

 

USING THE ELBOW METHOD TO DETERMINE THE NUMBER OF CLUSTERS

Then there is a so called Elbow Method which also uses the within-cluster variance. You can adopt the method and algorith here. In this site the author suggest that “the location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters”.

Below you can perceive two plots with the Elbow method. I have to say that I’m not sure anymore what could be the “correct” number of clusters. Here I would rather say 2 than 5. Note that I draw the dashed lines just to compare the results.

But I’m not happy yet with the results. So let’s take a look at one more method after this.

 

set.seed(123)

# Computing and plotting wss for k = 2 to k = 10
k.max <- 10 # Maximal number of clusters
data <- reboston_scaled
wss <- sapply(1:k.max, 
        function(k){kmeans(data, k, nstart=10 )$tot.withinss})
plot(1:k.max, wss,
       type="b", pch = 1, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares", main = "The amount of clusters")
abline(v = 2, lty =2)

# Another functions, same method
library(factoextra); require(ggplot2)
fviz_nbclust(reboston_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 5, linetype = 2)

 

USING THE “ALL VALIDATION INDICES” TO DETERMINE THE NUMBER OF CLUSTERS

This method is also adopted from the website I introduced above. This might be “heavy” method when it computes all the 30 indices for determining the optimal number of clusters (see ?NbClust for additional information). Note that the command “all” actually drops some indices for the sake of faster computing.

So according to this method of majority rule, the best number of clusters is 2.

 

library(NbClust)

nb <- NbClust(reboston_scaled, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 6 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
nb
## $All.index
##          KL       CH Hartigan      CCC    Scott      Marriot    TrCovW
## 2  365.8858 138.3799  30.2810 -10.6105  546.752 3.282035e+33 379348.85
## 3    0.0051  88.3119  68.0689 -16.2831 1135.949 2.304751e+33 344603.04
## 4    0.4190  89.3535 157.2726 -14.9359 1713.232 1.309260e+33 251173.09
## 5    3.3113 127.0749  59.2562  -0.9093 2860.839 2.117710e+32 120863.36
## 6    4.9882 125.2845  20.4741   3.8642 3531.911 8.095706e+31  97956.82
## 7    0.6595 111.8671  24.1067   4.0150 3820.466 6.229953e+31  89795.14
## 8    0.2506 103.7538  73.4329   5.0250 4307.836 3.105734e+31  85970.73
## 9  121.0692 113.1228   7.0030  13.6910 4924.178 1.162705e+31  59638.61
## 10   0.1218 102.5418  11.2077  12.4447 5033.461 1.156612e+31  58526.58
##      TraceW Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2
## 2  5546.998  11.6512 1.2746 0.3501 1.8755     0.1927 0.9033  32.2200
## 3  5232.616  15.3777 1.3511 0.4039 1.5318     0.1870 0.7283  74.9817
## 4  4608.911  20.8047 1.5340 0.4425 1.6959     0.1720 0.6174 182.7857
## 5  3509.433  45.3481 2.0146 0.3677 1.7257     0.2152 0.6582  51.4077
## 6  3138.254  49.5877 2.2528 0.3991 1.4939     0.2251 0.7893  26.6922
## 7  3014.803  52.4833 2.3451 0.4023 1.5854     0.1980 0.7194  28.4728
## 8  2875.870  60.0652 2.4584 0.4049 1.4818     0.2014 0.6649  90.2336
## 9  2506.302  67.5566 2.8209 0.3635 1.4674     0.2147 0.4475   4.9376
## 10 2471.477  68.5392 2.8606 0.3873 1.3914     0.2126 0.7514   7.9406
##     Beale Ratkowsky      Ball Ptbiserial    Frey McClain   Dunn Hubert
## 2  1.0250    0.3193 2773.4992     0.3052 -0.7151  0.7265 0.0476  2e-04
## 3  3.5661    0.2869 1744.2052     0.3539  0.4543  0.7378 0.0563  3e-04
## 4  5.9326    0.2875 1152.2277     0.3723  0.1353  0.9946 0.0641  3e-04
## 5  4.9388    0.3104  701.8866     0.4607 -0.0405  1.7679 0.0644  4e-04
## 6  2.5390    0.2997  523.0423     0.4851  0.7911  1.8311 0.0731  5e-04
## 7  3.6965    0.2821  430.6862     0.4754 -0.0785  1.9927 0.0747  5e-04
## 8  4.8160    0.2683  359.4838     0.4819  0.2051  1.9994 0.0761  5e-04
## 9  9.4871    0.2641  278.4780     0.4892 -0.0289  2.4723 0.0766  5e-04
## 10 3.0515    0.2518  247.1477     0.4893  0.0608  2.4725 0.0815  5e-04
##    SDindex Dindex   SDbw
## 2   1.4091 3.1092 0.9516
## 3   1.6118 3.0485 1.1166
## 4   1.6525 2.8619 1.0039
## 5   1.7602 2.4429 0.9631
## 6   1.5711 2.3173 0.8411
## 7   1.6287 2.2698 0.7837
## 8   1.6017 2.2312 0.8377
## 9   1.5912 2.0942 0.7618
## 10  1.4766 2.0818 0.6648
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.8871            38.3134       0.4243
## 3          0.8721            29.4697       0.0000
## 4          0.8864            37.8040       0.0000
## 5          0.8377            19.1787       0.0000
## 6          0.8383            19.2906       0.0013
## 7          0.8190            16.1362       0.0000
## 8          0.8673            27.3952       0.0000
## 9          0.4753             4.4164       0.0000
## 10         0.7243             9.1356       0.0002
## 
## $Best.nc
##                       KL       CH Hartigan    CCC    Scott      Marriot
## Number_clusters   2.0000   2.0000   5.0000  9.000    5.000 5.000000e+00
## Value_Index     365.8858 138.3799  98.0165 13.691 1147.607 9.666747e+32
##                   TrCovW   TraceW Friedman   Rubin Cindex      DB
## Number_clusters      5.0   5.0000   5.0000  9.0000 2.0000 10.0000
## Value_Index     130309.7 728.2982  24.5433 -0.3228 0.3501  1.3914
##                 Silhouette   Duda PseudoT2 Beale Ratkowsky     Ball
## Number_clusters     6.0000 2.0000     2.00 2.000    2.0000    3.000
## Value_Index         0.2251 0.9033    32.22 1.025    0.3193 1029.294
##                 PtBiserial Frey McClain    Dunn Hubert SDindex Dindex
## Number_clusters    10.0000    1  2.0000 10.0000      0  2.0000      0
## Value_Index         0.4893   NA  0.7265  0.0815      0  1.4091      0
##                    SDbw
## Number_clusters 10.0000
## Value_Index      0.6648
## 
## $Best.partition
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2   2   2   2 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   2   2   2   1   1   2   2   2   2   2   2   2   2   2   1   1   1   1 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   1   1   1   1   1   1   2   2   1   1   1   1   1   1   1   1   2   2 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   2   2   2   2   1   2   2   2   2   1   2   2   2   2   2   2   1   1 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 
##   1   1   2   2   2   1   1   1   1   1   2   2   2   2   2   2   2   2 
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   2   2   2   1   2   2   1   2   1   1   2   2   2   2   2   2   2   2 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1   2 
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
##   2   2   2   2   2   2   2   2   1   2   1   1   2   1   2   2   1   1 
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   2   2   1   2   2   2   2   2   2   1   1   1   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 
##   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 
##   1   1   1   1   1   1   1   2   2   2   1   1   1   1   1   2   2   2 
## 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 
##   1   1   1   1   1   1   1   1   1   1   2   2   1   1   1   1   1   1 
## 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 
##   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 
##   1   1   1   1   1   2   2   2   2   2   1   1   1   1   1   1   1   1 
## 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 
##   1   1   1   2   2   1   2   1   1   2   2   2   1   1   2   2   2   2 
## 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 
##   2   2   2   2   2   2   2   1   1   2   2   2   2   2   2   2   2   1 
## 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 
##   2   2   2   1   1   2   2   2   1   1   1   1   1   2   2   2   2   2 
## 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 505 506 
##   2   2
# Visualizing the results
fviz_nbclust(nb) + theme_minimal()
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 1 proposed  3 as the best number of clusters
## * 6 proposed  5 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 2 proposed  9 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

 

Here you can look again the pair matrix of two clusters.

 

# Lets plot the pairs with two centers again

library(dygraphs)

pairs(reboston_scaled, col = km_eu$cluster, main = "Pairs with 2 clusters")

 


5 Dimensionality reduction techniques

 

In this chapter the purpose is to get familiar with some of the Dimensionality Reduction Techniques including Principal Component Analysis (PCA) and Multiple Correspondece Analysis (MCA).

Principal Component Analysis is way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. PCA is a procedure After founding the patterns PCA has the tools of compressing the data by reducing the number of dimensions, without much loss of information. Additional info of PCA

Multiple Correspondence Analysis is used for the same purpose as PCA but it can be utilized if the data contains categorical variables. Additional info of MCA

 

5.1 Brief description of the dataset

 

In this chapter I will be working with the Data of Human Development approach adopted from United Nations Development programme, UNDP. This programme has explored human development around the world with specific indices in each year since 1990, Country List in 2015. The specified indices are:

  1. the Human Development Index, HDI
  2. the Inequality-Adjusted Human Development Index, IHDI
  3. the Gender Development Index, GDI
  4. the Gender Inequality Index, GII
  5. the Multidimensional Poverty Index, MPI


I’m going to work with HDI and GII indices by combining the datasets and using specific variables including:

  1. country - technically a rowname
  2. SFM_edu = SFedu / SMedu, where
    • SFedu - Females, Population with at least some secondary education (% ages 25 and older)
    • SMedu - Males, Population with at least some secondary education (% ages 25 and older)
  3. FMlabour = Flabour / Mlabour, where
    • Mlabour - Males, Labour force participation rate (% ages 15 and older
    • Flabour - Females, Labour force participation rate (% ages 15 and older)
  4. Eedu - Expected years of schooling per country
  5. Elife - Expected life years at birth per country
  6. GNIpc - Gross national income per capita per country in PPP, (https://en.wikipedia.org/wiki/Purchasing_power_parity)
  7. mort_ratio - Maternal mortality ratio (deaths per 100,000 live births)
  8. ad_birth - Adolescent birth rate (births per 1,000 women ages 15-19)
  9. seats - Share of seats in parliament (% held by women)


Note 1, SFM_edu is a ratio of education where values less than 1 are signifying that males are more educated than females and values greater than 1 are implying that females are more educated than men.

Note 2, FMlabour is a ratio of labour force participation where values less than 1 are signifying that males are more participated than females and values greater than 1 are implying that females are more participated than men.

Note 3, variables from the Human Development Index data are Eedu, Elife and GNIpc while the other variables are from the Gender Inequality Index data.

 


For additional information you can find here the Human Development Index data and its components.

For additional information you can find here the Gender Inequality Index data and its components.


 

COMBINED DATASET

You can find the combined dataset from my GutHub respitory: https://github.com/8475394/IODS-project/blob/master/data/human_1.txt

 

library(tidyr); library(readr)
human_own <- read.table(file.path("C:/Users/heidi/Documents/YLIOPISTO/TILASTOTIEDE/INTRODUCTION TO OPEN DATA SCIENCE/IODS-project/data", "human_1.txt"), header = TRUE)

 

STRUCTURE OF THE HUMAN DATASET

From the structure table you can notice that there are 155 observation and 8 variables. Note that the country is a rowname and it doesn’t appear to variable list.

 

str(human_own)
## 'data.frame':    155 obs. of  8 variables:
##  $ SFM_edu   : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ FMlabour  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Eedu      : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Elife     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNIpc     : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mort_ratio: int  4 6 6 5 6 7 9 28 11 8 ...
##  $ ad_birth  : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ seats     : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

 

5.2 Summaries and graphs

 

In this section I’m going to explore the distributions of data variables even further. First I show you a summary table which contains the distributions of each variable where the variables are divided into quartiles including the minimum and maximum values.

After this I’ll show you a set of histograms and explore the potential relationhips between each variable.

 

SUMMARY TABLE OF DATA VARIABLES

You may see some intresting “phenomena” from the summary table. Both the ratio of education (SFM_edu) and labour force participation (FMlabour) suggest that males are more dominant in these fields. It seems that females are more equal in education than labour force participation. You can examine this visually from the histograms later in this chapter.

According to the Technical notes of the Human Development indices, the scale of expected years of schooling per country (Eedu) varies between 0 and 18 where the latter is equivalent to achieving a master’s degree in most countries. However you can see that the actual maximum of the schooling distribution is a bit over 20. If I understood right the range of 0-18 is used when counting the overall Human Development Index. Nevertheless I think the average of expected schooling years is quite good even though these values don’t tell us anything about specific countries.

I can’t help myself to wonder the minimum value of Expected life years at birth per country (Elife). 49 years seem so low even though the mean and median values are much greater.

I think the distribution of Gross National Income per capita PPP emphasizes the huge distinctions between countries and of their wealth. Note that GNI per capita in PPP measures purchasing power parity which is useful for making comparisons between countries, PPP. However, there is a but when measuring the Human Development Index. If you look at the table of HDI (as seen earlier in this chapter), you can perceive that the GNI is not necessary the most critical measure in terms of well-being. For example the GNI of Montenegro is lower than average but the Human Development Index in very high. Note also that this was kind of an additional information since the actual HD index is not a part of this examination.

Based on the technical notes the Maternal Mortality rate represent the maternal health, the higher the rate the poorer the maternal health. If you look at the table of GII (as seen earlier in this chapter), you can perceive that the maximum value of the maternal mortality among countries with very high human development is 69. This is just to give you some understanding of this scale.

Adolescent birth rate represents the rate of births per 1000 women between ages 15 and 19. I think this scale varies also a lot. However the distribution is skewed to the right (positive skewness) suggesting that there are more lower than higher values in this scale.

The distribution of seats represents the share (%) of seats in parliament held by women in each country. Here you can see for example that it is not common to have more females than males in parliaments as the median and mean values are both close to 20 %. Also there are just few countries where the women are in majority.

 

library(knitr)
kable(summary(human_own), format = "pandoc", digits = 2,  align="l")
SFM_edu FMlabour Eedu Elife GNIpc mort_ratio ad_birth seats
Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50

 

HISTOGRAMS OF EDUCATION, LABOUR FORCE, YEARS OF SCHOOLING AND LIFESPAN


library(ggplot2); library(gridExtra);library(dygraphs)

# INITIALIZING THE PLOTS

h1 <- qplot(human_own$SFM_edu, geom = "histogram", binwidth = 0.1, main = "Population with at least sec. school", xlab = "ratio of education, female/male", ylab = "number of countries", col = I("grey"), fill = I("chartreuse3"))

h2 <- qplot(human_own$FMlabour, geom = "histogram", binwidth = 0.08, main = "Labour force participation rate", xlab = "ratio of labour force, female/male", ylab = "number of countries", col = I("grey"), fill = I("orange"))

h3 <- qplot(human_own$Eedu, geom = "histogram", binwidth = 1.2, main = "Expected years of schooling per country", xlab = "years per country", ylab = "number of countries", col = I("grey"), fill = I("mediumorchid"))

h4 <- qplot(human_own$Elife, geom = "histogram", binwidth = 4, main = "Expected life years at birth per country", xlab = "years per country", ylab = "number of countries", col = I("grey"), fill = I("blue"))

# Combining the plots using the function grid.arrange()

grid.arrange(h1, h2, h3, h4, ncol=2, nrow =2)

 

HISTOGRAMS OF GNI, MATERNAL MORTALITY, ADOLESCENT BIRTH RATE AND SHARE OF SEATS


library(ggplot2); library(gridExtra); library(dygraphs)

# INITIALIZING THE PLOTS

h5 <- qplot(human_own$GNIpc, geom = "histogram", binwidth = 10000, main = "Gross National Income, GNI", xlab = "GNI per capita PPP", ylab = "number of countries", col = I("grey"), fill = I("red"))

h6 <- qplot(human_own$mort_ratio, geom = "histogram", binwidth = 50, main = "Maternal mortality ratio", xlab = "deaths per 100 000 live births", ylab = "number of countries", col = I("grey"), fill = I("cornflowerblue"))

h7 <- qplot(human_own$ad_birth, geom = "histogram", binwidth = 10, main = "Adolescent birth rate", xlab = "births per 1000 women ages 15-19", ylab = "number of countries", col = I("grey"), fill = I("yellow"))

h8 <- qplot(human_own$seats, geom = "histogram", binwidth = 3, main = "Share of seats in parliament", xlab = "% held by women", ylab = "number of countries", col = I("grey"), fill = I("violetred2"))

# Combining the plots using the function grid.arrange()

grid.arrange(h5, h6, h7, h8, ncol=2, nrow =2)

 

 

INITIAL RESULTS OF THE RELATIONSHIPS

Here you can have an initial look at the relationships between human data variables. I added a regression line into each scatterplot to better perceive the direction of the relationships. Next we can take a closer look at the actual values of correlations.


library(GGally); library(ggplot2)
g = ggpairs(human_own, lower = list(continuous = wrap("smooth", method = "lm", color="red")), upper = list(continuous = wrap("density", color = "blue")))
g

 

 

CORRELATION MATRIX BETWEEN DATA VARIABLES

Here I’m going to explore more closely the linear relationships between variables to get a better understanding of the data. You can see correlation coefficients of each pair from table below. It is interesting that there is almost zero correlation between SFM_edu (ratio of education, F/M) and FMlabour (ratio of labour force, F/M). Initially I thought that there may be a linear relationship between these variables at least to some extent. However, the SFM_Edu has some other interesting relationships for example with Maternal mortality ratio (mort_ratio) and Adolescent birth rate (ad_birth). These relationships are negative suggesting that when the ratio of Female Education is high the ratios of Maternal mortality and Adolecent’s birth are respectively low. You can perceive the same logic also between Expected years of schooling per country (Eedu) and the mort_ratio/ad_birth as well as between Expected life years at birth per country (Elife) and the mort_ratio/ad_birth.

Interestingly the Gross National Income per capita (GNIpc) has a positive relationship with Expected years of schooling per country (Eedu) and Expected life years at birth per country (Elife). It may be a natural consequence that if a country is wealthy they can invest more for education. But I think there might be some other components affecting to the lifespan of population even though there is a positive correlation between GNI and life years.

If you look at the “Seats” column (or row), i.e. Share of seats in parliament (% held by women), there are no high correlation values while it has relationships with FMlabour and Eedu in to some extent. The former may suggest that when the ratio of female labour force participation is high the number of seats in parliament (held by women) can also be high and vice versa. The latter can imply that when the Expected years of schooling per country is high the number of seats in parliament (held by women) can also be high. Here I can imagine that if there are high amount of schooling years in some country there would be also more educated women among the population to be voted to the parliament. Note that the latter view point was just “an educated guess”!

After the correlation matrix you can see a more visual “correlation matrix”.

 

library(tidyverse); library(knitr)
cor_matrix <- cor(human_own) %>% round(digits = 2)
kable(cor_matrix, format = "pandoc", digits = 2,  align="c")
SFM_edu FMlabour Eedu Elife GNIpc mort_ratio ad_birth seats
SFM_edu 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
FMlabour 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
Eedu 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
Elife 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
GNIpc 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
mort_ratio -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
ad_birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
seats 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00

 

PLOTTING THE CORRELATION MATRIX


library(corrplot)
corrplot(cor_matrix, method = "circle", tl.cex = 0.8, addCoef.col = "black")

 

5.3 Principal Component Analysis, PCA, with the NOT standardized data

 

Here I’m going to perform principal component analysis (PCA) without standardizing the Human data. Note that the PCA is performed again within the next chapter but with standardized data. Thus the purpose is to perceive and interpret if there are any differences between these methods or models and my guess is that the actual purpose is to show and justify why the data has to be standardized. So let’s take a look my the first model.

There are to slightly different methods to perform the PCA: the Eigenvalue Decomposition and the Singular Value Decomposition. According to our instructions in DataCamp the latter is preferred.

 

SUMMARY TABLE AS A REMINDER AND FOR COMPARING THESE RESULTS WITH RESULTS OF THE NEXT CHAPTER

library(knitr)
kable(summary(human_own), format = "pandoc", digits = 2,  align="l")
SFM_edu FMlabour Eedu Elife GNIpc mort_ratio ad_birth seats
Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50

 

PERFORMING PCA WITH SINGULAR VALUE DECOMPOSITION METHOD

 

# Performing the PCA with Singular Value Decomposition (SVD) method
pca_human1 <- prcomp(human_own)
s1 <- summary(pca_human1)
s1
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# Drawing a biplot of the principal component representation and the original variables
biplot(pca_human1, choices = 1:2, cex = c(0.8, 1), col = c("dimgrey", "deeppink"))

 

ADJUSTING THE CODE TO GET THE PERCENTAGES OF VARIANCE BY EACH PCA

 

# rounded percetanges of variance captured by each PCA
pca_pr1 <- round(100*s1$importance[2, ], digits = 1)
print(pca_pr1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# creating object pc_lab to be used as axis labels
pc_lab1 <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
pc_lab1
## [1] "PC1 (100%)" "PC2 (0%)"   "PC3 (0%)"   "PC4 (0%)"   "PC5 (0%)"  
## [6] "PC6 (0%)"   "PC7 (0%)"   "PC8 (0%)"
# drawing the biplot again with captions
biplot(pca_human1, cex = c(0.8, 1), col = c("dimgrey", "deeppink"), xlab = "100%", ylab = "0%")

 

5.4 Repeating the PCA with standardized data

 

WRITING HERE

 

SUMMARY TABLE OF THE STANDARDIZED HUMAN DATA

# Standardizing the Human data
std_human <- scale(human_own)

library(knitr)
kable(summary(std_human), format = "pandoc", digits = 2,  align="l")
SFM_edu FMlabour Eedu Elife GNIpc mort_ratio ad_birth seats
Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188 Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056 Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218 Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850

 

PERFORMING PCA WITH SINGULAR VALUE DECOMPOSITION METHOD

 

# Performing the PCA with Singular Value Decomposition (SVD) method
pca_human2 <- prcomp(std_human)
s2 <- summary(pca_human2)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# Drawing a biplot of the principal component representation and the original variables
biplot(pca_human2, choices = 1:2, cex = c(0.8, 1), col = c("dimgrey", "deeppink"))

 

ADJUSTING THE CODE TO GET THE PERCENTAGES OF VARIANCE BY EACH PCA

 

# rounded percetanges of variance captured by each PCA
pca_pr2 <- round(100*s2$importance[2, ], digits = 1)
print(pca_pr2)
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# creating object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
pc_lab2
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)"  "PC4 (7.6%)"  "PC5 (5.5%)" 
## [6] "PC6 (3.6%)"  "PC7 (2.6%)"  "PC8 (1.3%)"
# drawing the biplot again with captions
biplot(pca_human2, cex = c(0.8, 1), col = c("dimgrey", "deeppink"), xlab = "100%", ylab = "0%")